library(tidyverse)
library(dplyr)
library(ggplot2)
library(usmap)
library(ggthemes)
MATT PUT IN DATA INFO HERE (when corona cases stopped tracking, what is pov dataset)
covidData <- read.csv("coronacounties.csv")
povertyData <- read.csv("PovertyEstimates.csv")
populationData <- usmap::countypop
populationData$fips <- as.integer(populationData$fips)
First, let’s start by glimpsing the 2 datasets.
glimpse(covidData)
## Observations: 61,971
## Variables: 6
## $ date <fct> 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, 2020-01-24, 20…
## $ county <fct> Snohomish, Snohomish, Snohomish, Cook, Snohomish, Orange, Cook…
## $ state <fct> Washington, Washington, Washington, Illinois, Washington, Cali…
## $ fips <int> 53061, 53061, 53061, 17031, 53061, 6059, 17031, 53061, 4013, 6…
## $ cases <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ deaths <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
covidData %>%
group_by(fips) %>%
count()
## # A tibble: 2,709 x 2
## # Groups: fips [2,709]
## fips n
## <int> <int>
## 1 1001 23
## 2 1003 33
## 3 1005 13
## 4 1007 17
## 5 1009 22
## 6 1011 21
## 7 1013 22
## 8 1015 29
## 9 1017 28
## 10 1019 22
## # … with 2,699 more rows
At first glimpse, the COVID-19 dataset has 61,971 observations and 6 variables. However, after grouping by FIPS code, we see that the dataset contains data on 2,709 counties, with repeated observations representing cases on different dates.
glimpse(povertyData)
## Observations: 3,193
## Variables: 34
## $ FIPStxt <int> 0, 1000, 1001, 1003, 1005, 1007, 1009…
## $ Stabr <fct> US, AL, AL, AL, AL, AL, AL, AL, AL, A…
## $ Area_name <fct> United States, Alabama, Autauga Count…
## $ Rural.urban_Continuum_Code_2003 <int> NA, NA, 2, 4, 6, 1, 1, 6, 6, 3, 6, 8,…
## $ Urban_Influence_Code_2003 <int> NA, NA, 2, 5, 6, 1, 1, 6, 6, 2, 5, 6,…
## $ Rural.urban_Continuum_Code_2013 <int> NA, NA, 2, 3, 6, 1, 1, 6, 6, 3, 6, 6,…
## $ Urban_Influence_Code_2013 <int> NA, NA, 2, 2, 6, 1, 1, 6, 6, 2, 5, 6,…
## $ POVALL_2018 <fct> "41,852,315", "801,758", "7,587", "21…
## $ CI90LBAll_2018 <fct> "41,619,366", "785,668", "6,334", "17…
## $ CI90UBALL_2018 <fct> "42,085,264", "817,848", "8,840", "24…
## $ PCTPOVALL_2018 <dbl> 13.1, 16.8, 13.8, 9.8, 30.9, 21.8, 13…
## $ CI90LBALLP_2018 <dbl> 13.0, 16.5, 11.5, 8.1, 25.8, 17.1, 10…
## $ CI90UBALLP_2018 <dbl> 13.2, 17.1, 16.1, 11.5, 36.0, 26.5, 1…
## $ POV017_2018 <fct> "12,997,532", "255,613", "2,509", "6,…
## $ CI90LB017_2018 <fct> "12,873,127", "247,744", "1,965", "4,…
## $ CI90UB017_2018 <fct> "13,121,937", "263,482", "3,053", "8,…
## $ PCTPOV017_2018 <dbl> 18.0, 23.9, 19.3, 13.9, 43.9, 27.8, 1…
## $ CI90LB017P_2018 <dbl> 17.8, 23.2, 15.1, 10.2, 35.0, 20.7, 1…
## $ CI90UB017P_2018 <dbl> 18.2, 24.6, 23.5, 17.6, 52.8, 34.9, 2…
## $ POV517_2018 <fct> "8,930,152", "178,175", "1,891", "4,5…
## $ CI90LB517_2018 <fct> "8,834,521", "171,349", "1,469", "3,2…
## $ CI90UB517_2018 <fct> "9,025,783", "185,001", "2,313", "5,8…
## $ PCTPOV517_2018 <dbl> 17.0, 22.8, 19.5, 13.1, 36.7, 26.3, 1…
## $ CI90LB517P_2018 <dbl> 16.8, 21.9, 15.1, 9.3, 27.5, 19.0, 10…
## $ CI90UB517P_2018 <dbl> 17.2, 23.7, 23.9, 16.9, 45.9, 33.6, 2…
## $ MEDHHINC_2018 <fct> "61,937", "49,881", "59,338", "57,588…
## $ CI90LBINC_2018 <fct> "61,843", "49,123", "53,628", "54,437…
## $ CI90UBINC_2018 <fct> "62,031", "50,639", "65,048", "60,739…
## $ POV04_2018 <fct> "3,758,704", "73,915", "", "", "", ""…
## $ CI90LB04_2018 <fct> "3,714,862", "69,990", "", "", "", ""…
## $ CI90UB04_2018 <fct> "3,802,546", "77,840", "", "", "", ""…
## $ PCTPOV04_2018 <dbl> 19.5, 26.0, NA, NA, NA, NA, NA, NA, N…
## $ CI90LB04P_2018 <dbl> 19.3, 24.6, NA, NA, NA, NA, NA, NA, N…
## $ CI90UB04P_2018 <dbl> 19.7, 27.4, NA, NA, NA, NA, NA, NA, N…
The Poverty dataset has 3,193 observations (counties, states, and the US) and 34 variables.
Both the COVID-19 dataset and the Poverty dataset have a variation of the variable fips, which is the Federal Information Processing Standard [https://en.wikipedia.org/wiki/FIPS_county_code] that allows counties to be uniquely identified.
Thus, the FIPS variable is a good variable by which to join these 2 datasets:
data <- merge(covidData, povertyData, by.x = 'fips', by.y = 'FIPStxt')
glimpse(data)
## Observations: 61,164
## Variables: 39
## $ fips <int> 1001, 1001, 1001, 1001, 1001, 1001, 1…
## $ date <fct> 2020-03-27, 2020-04-07, 2020-03-25, 2…
## $ county <fct> Autauga, Autauga, Autauga, Autauga, A…
## $ state <fct> Alabama, Alabama, Alabama, Alabama, A…
## $ cases <int> 6, 12, 4, 12, 6, 19, 10, 25, 23, 7, 1…
## $ deaths <int> 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ Stabr <fct> AL, AL, AL, AL, AL, AL, AL, AL, AL, A…
## $ Area_name <fct> Autauga County, Autauga County, Autau…
## $ Rural.urban_Continuum_Code_2003 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Urban_Influence_Code_2003 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Rural.urban_Continuum_Code_2013 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Urban_Influence_Code_2013 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ POVALL_2018 <fct> "7,587", "7,587", "7,587", "7,587", "…
## $ CI90LBAll_2018 <fct> "6,334", "6,334", "6,334", "6,334", "…
## $ CI90UBALL_2018 <fct> "8,840", "8,840", "8,840", "8,840", "…
## $ PCTPOVALL_2018 <dbl> 13.8, 13.8, 13.8, 13.8, 13.8, 13.8, 1…
## $ CI90LBALLP_2018 <dbl> 11.5, 11.5, 11.5, 11.5, 11.5, 11.5, 1…
## $ CI90UBALLP_2018 <dbl> 16.1, 16.1, 16.1, 16.1, 16.1, 16.1, 1…
## $ POV017_2018 <fct> "2,509", "2,509", "2,509", "2,509", "…
## $ CI90LB017_2018 <fct> "1,965", "1,965", "1,965", "1,965", "…
## $ CI90UB017_2018 <fct> "3,053", "3,053", "3,053", "3,053", "…
## $ PCTPOV017_2018 <dbl> 19.3, 19.3, 19.3, 19.3, 19.3, 19.3, 1…
## $ CI90LB017P_2018 <dbl> 15.1, 15.1, 15.1, 15.1, 15.1, 15.1, 1…
## $ CI90UB017P_2018 <dbl> 23.5, 23.5, 23.5, 23.5, 23.5, 23.5, 2…
## $ POV517_2018 <fct> "1,891", "1,891", "1,891", "1,891", "…
## $ CI90LB517_2018 <fct> "1,469", "1,469", "1,469", "1,469", "…
## $ CI90UB517_2018 <fct> "2,313", "2,313", "2,313", "2,313", "…
## $ PCTPOV517_2018 <dbl> 19.5, 19.5, 19.5, 19.5, 19.5, 19.5, 1…
## $ CI90LB517P_2018 <dbl> 15.1, 15.1, 15.1, 15.1, 15.1, 15.1, 1…
## $ CI90UB517P_2018 <dbl> 23.9, 23.9, 23.9, 23.9, 23.9, 23.9, 2…
## $ MEDHHINC_2018 <fct> "59,338", "59,338", "59,338", "59,338…
## $ CI90LBINC_2018 <fct> "53,628", "53,628", "53,628", "53,628…
## $ CI90UBINC_2018 <fct> "65,048", "65,048", "65,048", "65,048…
## $ POV04_2018 <fct> , , , , , , , , , , , , , , , , , , ,…
## $ CI90LB04_2018 <fct> , , , , , , , , , , , , , , , , , , ,…
## $ CI90UB04_2018 <fct> , , , , , , , , , , , , , , , , , , ,…
## $ PCTPOV04_2018 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ CI90LB04P_2018 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ CI90UB04P_2018 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
state_cases <- data %>%
filter(date == "2020-04-15") %>%
group_by(state) %>%
summarise(total_cases = sum(cases))
cases_map <- plot_usmap(data = state_cases, values = "total_cases", labels = TRUE, label_color = "gray", color = "red")
cases_map <- cases_map +
scale_fill_gradient(low = "white", high = "#CB454A", name = "Cases") +
labs(title = "Cases by State on April 15, 2020") +
labs(subtitle = "Most cases are concentrated in New York State.") +
theme(legend.position = "right")
cases_map
Need to add histograms for the response variable we choose, the predictors we choose, and plots for the response v. predictors relationships (We can do this in the section following - response variable description and predictor descriptions)
county_cases <- data %>%
filter(date == "2020-04-15")
county_cases <- merge(x = populationData, y = county_cases, by = "fips", all = TRUE)
county_cases$log_cases <- log(county_cases$cases)
county_cases$cases[is.na(county_cases$cases)] <- 0
county_cases$log_cases[is.na(county_cases$log_cases)] <- 0
county_cases_map <- plot_usmap("counties", data = county_cases, values = "cases", color = "red", size=0.01)
county_cases_map <- county_cases_map +
scale_fill_gradient(low = "white", high = "#CB454A", name = " Cases") +
labs(title = "Cases by County on April 15, 2020") +
labs(subtitle = "Most cases are concentrated in New York State.") +
theme(legend.position = "right")
county_cases_map
county_cases_map <- plot_usmap("counties", data = county_cases, values = "log_cases", color = "red", size=0.01)
county_cases_map <- county_cases_map +
scale_fill_gradient(low = "white", high = "#CB454A", name = "Log Cases") +
labs(title = "Log Cases by County on April 15, 2020") +
labs(subtitle = "Most cases are concentrated in New York State.") +
theme(legend.position = "right")
county_cases_map
plot_poverty_data <- data <- merge(populationData, povertyData, by.x = 'fips', by.y = 'FIPStxt')
county_cases_map <- plot_usmap("counties", data = plot_poverty_data, values = "PCTPOVALL_2018", color = "red", size=0.01)
county_cases_map <- county_cases_map +
scale_fill_gradient(low = "white", high = "#CB454A", name = "PCTPOVALL_2018") +
labs(title = "Poverty Level by County (2018 values)") +
theme(legend.position = "right")
county_cases_map
We should look at histograms of both cases and log(cases) to see if the apparent more even spread of log(cases) is normal compared to regular cases
county_cases %>%
ggplot(mapping = aes(x = cases)) +
geom_histogram(binwidth = 600) +
labs(title = "Histogram of Number of Cases per County", x = "Number of Cases",
y = "# Counties with X Cases")
As we predicted, we see an incredible amount of right skewness in the distribution of cases per county.
county_cases %>%
ggplot(mapping = aes(x = log(cases))) +
geom_histogram(binwidth = 0.5) +
labs(title = "Histogram of Number of log of # Cases per County", x = "Number of Cases",
y = "# Counties with log(X) Cases")
## Warning: Removed 450 rows containing non-finite values (stat_bin).
This historam of the log(cases) per county shows much more evidence of normailty than the graph of just cases. It would be a much better response variable, however we should analyze some other options as well.
Another possible option to analyze as a response would be the number of cases per capita of a county. This may work better as it would correlate significantly less with population as we have “accounted” for this already in the response.
county_cases <- county_cases %>%
mutate(casesPC = cases/pop_2015)
casesPC_map <- plot_usmap("counties", data = county_cases, values = "casesPC", color = "red", size=0.01)
casesPC_map <- casesPC_map +
scale_fill_gradient(low = "white", high = "limegreen", name = "Cases Per Capita") +
labs(title = "Cases per Capita (by county)") +
theme(legend.position = "right")
casesPC_map
county_cases %>%
ggplot(mapping = aes(x = casesPC)) +
geom_histogram(binwidth = 0.0001) +
labs(title = "Histogram of Number of Cases/Capita per County", x = "Number of Cases/Capita",
y = "# Counties with X Cases per Capita")
Similarly to the map of cases per county, we see a large discrepancy between counties and may benefit from mapping the log of the cases per capita in order to better normalize the distribution. The histogram confirms this theory as it is majorly right skewed.
county_cases$log_casesPC <- log(county_cases$casesPC)
county_cases$casesPC[is.na(county_cases$casesPC)] <- 0
county_cases$log_casesPC[is.na(county_cases$log_casesPC)] <- 0
logcasesPC_map <- plot_usmap("counties", data = county_cases, values = "log_casesPC", color = "red", size=0.01)
logcasesPC_map <- logcasesPC_map +
scale_fill_gradient(low = "white", high = "limegreen", name = "Log of Cases per Capita") +
labs(title = "Log of Cases per Capita (by county)") +
theme(legend.position = "right")
logcasesPC_map
county_cases %>%
ggplot(mapping = aes(x = log_casesPC)) +
geom_histogram(binwidth = 0.3) +
labs(title = "Histogram of Number of log(Cases/Capita per County)", x = "Number of log(Cases/Capita)",
y = "# Counties with log(X) Cases per Capita")
## Warning: Removed 450 rows containing non-finite values (stat_bin).
We see a much more even distribution using the log of cases per capita. It is important to note that many counties appear as grey in this map; these counties have zero cases and thus the log of them returns the value of negative infinity causing this. We can remove or standardize them in our later analysis. The histogram of the log of cases per capita is very normal which would be perfect for analysis.
Ideas for regression - Linear regression: predict from multiple factors (FIP code, poverty rate, race breakdown, etc.) their number of cases or deaths - Logistic regression: predict if factors contribute to a death or not (don’t think we have specific data on specific cases so this could be challenging)
Assumptions for regression - Independence - Linearity - Normality - Constant Variance
This website has specific graphs that would be helpful to replicate, but don’t know where to get the data on an aggregate level: [https://www.apmresearchlab.org/covid/deaths-by-race#black]